Back

Journal of Pathology Informatics

Elsevier BV

Preprints posted in the last 30 days, ranked by how well they match Journal of Pathology Informatics's content profile, based on 13 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.

1
Assessing Foundation Models for Computational Pathology in Endometrial Cancer

Volinsky-Fremond, S.; van den Berg, N.; Barkey Wolf, J.; Schoenpflug, L. A.; Andani, S.; Ortoft, G.; Jobsen, J. J.; Lutgens, L. C.; Powell, M. E.; Mileshkin, L. R.; Mackay, H.; Leary, A.; Razack, R. R.; de Bruyn, M.; de Boer, S. M.; Nout, R. A.; Smit, V. T.; Creutzberg, C. L.; Koelzer, V. H.; Bosse, T.; Horeweg, N.

2026-05-25 pathology 10.64898/2026.05.22.26353897 medRxiv
Top 0.1%
18.3%
Show abstract

Computational pathology leverages deep learning to extract clinically relevant information from digitized tumor slides, predicting histopathological subtypes, molecular alterations, and patient outcomes. Recent pipelines increasingly rely on foundation models trained on large pan-cancer datasets to generate generalizable features. In endometrial cancer (EC), their comparative performance for clinical diagnostic tasks remains unexplored. For the first time, this study evaluates the performance of seven state-of-the-art foundation models across morphological, molecular, and prognostic tasks using a large EC dataset of 3,293 patients from randomized trials and clinical cohorts. In addition, their performance was compared to one model (EsVIT) exclusively trained on EC. The foundation models H-OPTIMUS-0, CONCH, and VIRCHOW2, achieved the highest mean performance, but the best-performing foundation model varied by task. The top-performing foundation model outperformed the EC-specific feature extractor EsVIT across all tasks. This study highlights the superiority of foundation models over a domain-specific feature extractor in EC. Selecting the optimal foundation model for novel tasks remains challenging due to performance plateaus and limited information on the training datasets, requiring rigorous benchmarking and domain insight to reach maximum potential.

2
Unsupervised Tissue Concepts for Explainable Sarcoma Subtype Prediction from H&E

Bisson, T.; Ingram, D.; Singh, S.; Li, A.; Flynn, S.; Wang, W.-L.; Kim, A. E.; Bridge, C. P.; Demicco, E. G.; Sorrentino, A.; Jiang, S.; Hung, Y. P.; Lazar, A. J.; Iafrate, A. J.

2026-05-20 pathology 10.64898/2026.05.15.26353333 medRxiv
Top 0.1%
6.3%
Show abstract

Soft tissue sarcomas are a rare, heterogeneous group of tumors whose diagnosis remains challenging because of overlapping morphology and limited access to sarcoma-specialized pathologists. Although pathology foundation models have shown promise in computational pathology, their clinical translation remains limited by insufficient interpretability, particularly in diagnostically complex settings such as sarcoma diagnosis. Here, we developed and evaluated an H&E-based AI framework for sarcoma subtype classification that focused on explanability. Using the CONCH v1.5 foundation model, we computed embeddings from a tissue microarray cohort of 2,545 cases spanning 19 sarcoma subtypes and trained an attention-based multiple-instance learning model that achieved a balanced accuracy of 77.38% (SD 1.88). To move explainability beyond attention-based localization, we trained a sparse autoencoder on patch-level embeddings to learn 768 recurring visual concepts. 90 high-activation concepts were reviewed by three senior pathologists and curated into morphologically meaningful and non-meaningful categories, yielding a semantic dictionary of 41 diagnostically relevant tissue concepts. We then trained a linear attention-based model on the 768-concept vectors, which retained much of the performance of the raw embedding-based ABMIL model, achieving a balanced accuracy of 73.74% (SD 1.30). When restricting the linear model to pathologist-curated morphologic concepts only, balanced accuracy further decreased to 67.04% (SD 1.27), suggesting that the residual performance gain in the full concept model was driven by inconsistent, technical, or diagnostically irrelevant concepts. Concept-level explanations of the curated linear attention-based model aligned with known sarcoma morphology, including lipogenic, myxoid, spindle-cell, pleomorphic, vascular, small round blue cell, and matrix-forming patterns, and reproduced patterns of diagnostic overlap observed in human sarcoma pathology. Together, these results show that H&E-based foundation-model representations capture meaningful diagnostic structure within the known limitations of H&E in sarcoma diagnostics, but that their clinical value depends on whether this structure can be made interpretable to pathologists. Sparse autoencoder-derived concepts can address this critical gap by converting embedding-level signal into recurring morphologic patterns that pathologists can review and name, providing the foundation to link these patterns to subtype predictions. In doing so, this approach turns concept discovery into a practical form of diagnostic explanation, while also revealing where model performance is supported by recognizable histopathology and where it relies on diagnostically irrelevant or inconsistent visual patterns.

3
Automatic Bevacizumab Response Prediction in Ovarian Cancer from Digital Pathology Images via Novel AI-based Computational Pipeline

Alsaiari, A.; Turki, T.; Taguchi, Y.-h.

2026-05-04 bioinformatics 10.64898/2026.04.29.721782 medRxiv
Top 0.1%
3.7%
Show abstract

Ovarian cancer is one of the gynecological cancer types, which, if metastasized and not detected early, can cause deaths among women. Therefore, there is a need to accurately predict drug responses to ovarian cancer. A gynecological pathologist inspects abnormality in tissues, followed by providing a report about patients; however, such a diagnostic process is (1) hard; (2) requires experience; and (3) time consuming. Moreover, existing tools are far from perfect. Hence, we present a computational pipeline to improve predicting drug response pertaining to ovarian cancer, derived as follows. First, we download digital pathology images pertaining to ovarian bevacizumab response from the cancer imaging archive repository. We employed histogram of oriented gradients to images, constructing feature vectors, provided to Fisher linear discriminant analysis to change the representation through dimensionality reduction. Then, we provide reduced-dimensionality data for regression analysis through support vector regression coupled with various kernels and calculating the area under the ROC curve (AUC). Experimental results against transformer-based models (ViT and Swin) and other deep learning (DL) models (VGG16, ResNet50, InceptionV3, MobileNetV2, and EfficientNetB6) demonstrate that our approach with radial kernel (named SVRD+R) yielded an AUC performance improvements of 17% against the best-performing transformer-based model (ViT) while obtaining an AUC performance improvements of 14.9% when compared against the best DL-based model (MobileNetV2). These results demonstrate the superiority and feasibility of our AI-based pipeline when tackling prediction problems pertaining to gynecologic cancer studies. MSC92B05; 68T09

4
DigitAb: Domain-Adaptive Cell Type Prediction Method from Light Microscopy Images

Lucarelli, N.; Winfree, S.; Sabo, A.; Barwinska, D.; Ferkowicz, M.; Bowen, W.; Singh, A.; Chen, K.; Tatke, A.; Jen, K.-Y.; Eadon, M. T.; El-Achkar, T. M.; Jain, S.; Sarder, P.

2026-05-21 pathology 10.64898/2026.05.19.726313 medRxiv
Top 0.1%
3.6%
Show abstract

Light microscopy imaging with histological stains is central to disease diagnosis and research. It is enhanced with immunostaining to reveal cellular composition and complexity linked to clinical utility and biological mechanisms. Emerging multiplex imaging technologies like Phenocycler markedly increase the coverage to capture the cellular diversity but are costly, technically demanding, and inaccessible to most clinical laboratories. We developed DigitAb, a deep learning framework that classifies cell types directly from hematoxylin and eosin (H&E) stained slides, eliminating the need for specialized assays. Using Phenocycler imaging, we generated highlZlresolution ground truths for [~]3.5 million cells from 29 human kidney samples across four multi-institutional datasets to train a semantic segmentation model for 10 cell types, achieving a balanced accuracy of 0.78. By employing an integrated adversarial domain adaptation module, we tested DigitAb on unlabeled and untested biopsy samples from kidney transplant and diabetic samples. We were able to predict several cell types just from histology images, without using any special technology or immunostains, and demonstrate high concordance with clinical gold-standard Banff schema in kidney transplant rejection, and clinical characteristics of diabetic nephropathy. Our cloudlZlbased tool, DigitAb, provides scalable, accessible, labellZlfree cellular segmentation for research and clinical pathology.

5
Bridging Cotyledon Pathology and Perfusion in Healthy Primate Pregnancy

Keding, L. T.; Liu, R.-Y.; Keding, T. J.; Vazquez, J.; Bockoven, C. G.; Shah, D. M.; Golos, T. G.; Wieben, O.; Stanic, A. K.

2026-05-21 pathology 10.64898/2026.05.18.726079 medRxiv
Top 0.1%
2.8%
Show abstract

IntroductionHealthy and diseased placentae alike often display some degree of pathology. However, quantitative techniques to characterize common pathologies and their relationship to local maternal hemodynamics in healthy primate placentae are currently limited. MethodsPlacentae from seven rhesus macaques were imaged by MRI at three time points across mid-to late-gestation, to quantify placental blood volume, flow, and perfusion from maternal spiral arteries across pregnancy. Near term, we collected placental cotyledons, digitized hematoxylin/eosin-stained slides, then segmented and annotated sub-tissues and major pathologies (intervillous gaps, fibrin deposition, villous agglutination, inflammatory agglutination, and stromal mineralization) within each cotyledon. Individual pathologies were assessed in relation to each other and MRI perfusion metrics, in a cotyledon-specific manner. Parallel analyses were performed to investigate both basic (Spearman correlation) and animal variance-negated (dimensionality-reduction) relationships. ResultsCotyledons with increased stromal mineralization demonstrated low blood perfusion across pregnancy, alongside significant compensatory changes. Mineralization was further associated with decreased fetal weight, across all sub-tissues. Dimensionality reduction revealed maternal vascular malperfusion-associated pathologies as the largest contributor to dataset variance. Additionally, pathologies commonly associated with healthy placental function demonstrated low cotyledon blood flow and volume at all timepoints, with no evidence of compensatory changes across gestation. ConclusionsComprehensive digital annotation revealed several relationships connecting pathology and maternal blood perfusion in the healthy primate pregnancy, at the smallest functional unit of the placenta. This methodological framework embeds pathologist-refined morphological expertise into a quantitative, spatially resolved format that can ground, rather than be replaced by, unsupervised computational approaches to placental analysis.

6
Assessing the reliability of immunofluorescence image analysis with artificial intelligence

Bertin, D.; Bongrand, P.; Bardin, N.

2026-05-18 allergy and immunology 10.64898/2026.05.10.26352837 medRxiv
Top 0.2%
1.9%
Show abstract

In view of the outstanding progress of machine learning (ML) and growing cost of health systems, it is a current challenge to incorporate artificial intelligence tools into actual medical practice. Here we explored the feasibility and reliability of using machine learning to perform an important immunological investigation that currently requires experienced biologists : Anti-nuclear cytoplasmic antibodies (ANCAs) are important markers for vasculitis and they may be evidenced by microscopic examination of cells labeled with patients' sera. The use of a reliable ML classifier to discriminate between positive and negative samples would increase the rapidity and decrease the cost of immunofluorescence-based ANCA detection. Here, we tested seven well-documented ML algorithms, ranging from simple models such as k nearest neighbors to more complex convolutional neural networks involving millions of adjustable parameter. We studied the feasibility and reliability of classifying 1114 serum samples that had been collected for about 3 years and assayed with conventional procedure. We compared four strategies consisting of assaying either whole microscope fields or individual cell images, and natural images or histograms. The following conclusions were obtained : (i) Several different strategies allowed us to build models stable enough to discriminate between positive and negative samples collected during about 27 months, with a comparison to human classification yielding a kappa index of about 0.7, that may be considered as fairly good and intermediate between the performance of junior and senior biologists. (ii) Simpler ML models combined with theoretical thinking might provide the most rapid and efficient way of developing a reliable test within the framework of a single institution. (iii) In addition, the interpretability of the simplest model provided some theoretical insight into important classification parameters. (iv) An important point and caveat is that the multiplicity and versatility of currently available tools make it an essential requirement to test repeatedly a given model, that must be chosen as simple as possible, to achieve a reliability compatible with medical use. It is concluded that our study provides a strong incentive to incorporate ML tools in well defined medical tests, which might reduce the risk of human errors and pave the way to fully automatic procedures.

7
Anatomy-Guided 3D Graph Networks for Couinaud Segmentation in Tumor Affected Livers

You, L.; Dang, H.; Wang, H.; Matta, E.; zhou, X.

2026-05-14 bioinformatics 10.64898/2026.05.11.724316 medRxiv
Top 0.2%
1.7%
Show abstract

Image-based liver Couinaud segmentation is designed to automatically provide the locations of suspicious objects in liver CT/MR images. Once achieved, the physicians will be guided to the target slice and area where the suspicious node is located. However, conventional algorithms trained primarily on healthy liver images often fail to generalize to Hepatocellular Carcinoma (HCC) cases due to pathological structural distortions. In this work, we propose a robust two-stage framework that integrates a 3D Unet with a 3D Anatomical Structure-Guided Graph Convolutional Network (3D GCN). This two-stage strategy effectively isolates the liver volume to eliminate structural noise from neighboring organs, such as the spleen, allowing the framework to focus exclusively on the complex 3D anatomical relationships among the eight segments. To ensure the topological consistency required for global spatial reasoning, we implement a standardized preprocessing pipeline that normalizes liver-only volumes to exactly 50 frames along the z-axis. By combining a lightweight 3D UNet backbone with the 3D GCN for refined boundary reasoning, our model demonstrates superior generalization performance on unseen clinical datasets, achieving a mean Dice score of 0.828 in blind testing. By releasing our code and pretrained weights, we aim to provide the first publicly available deep learning resource for robust Couinaud segmentation.

8
An Interactive Trustworthy AI Pathology Copilot to Improve Biomarker-Driven Prognostic Stratification and Therapeutic Response Prediction

Mao, Y.; Xie, C.; Li, F.; Li, D.; Zhang, W.; Zhang, Y.; Li, B.; Zhao, C.; Zhang, Z.; Tan, Y.; Cen, Z.; Tao, H.; Yang, J.; Wang, J.; Feng, Q.; Liu, B.; Liang, L.; Lu, C.; Zhang, Y.; Ning, Z.

2026-05-19 pathology 10.64898/2026.05.17.26352870 medRxiv
Top 0.2%
1.5%
Show abstract

Predictive assays for precision oncology increasingly rely on multi-scale biomarkers that manifest as morphologic signatures in routine whole-slide images (WSIs). However, most computational pathology models treat biomarker profiling and outcome prediction (i.e., prognostic stratification and therapeutic response) as independent tasks, and lack the interactive and trustworthy capabilities required for clinical translation. Here, we present TEAM, an interactive trustworthy AI pathology copilot that improves biomarker-driven outcome prediction. Pretrained on 55,648 pan-cancer WSIs and 1,750,648 regions of interest (ROIs), comprising 360 million patches, TEAM learns risk-aware embeddings by conditioning on clinical metadata and aligning with relative risk prior. For trustworthy assessment, TEAM quantifies patch-level data (aleatoric) and model (epistemic) uncertainty, then propagates these estimates to patient-level predictions. In outcome prediction, profiled biomarkers serve as intermediate features to contextualize prognostic and therapeutic estimates. Beyond passive prediction, TEAM integrates vision-language models with agentic orchestration for clinical reasoning, and provides a web-based clinician-in-the-loop interface for interactive prediction refinement. Evaluated across 48 multi-institutional cohorts encompassing 85 benchmarks, TEAM consistently outperforms existing methods across biomarker profiling, prognostic stratification, and therapeutic response prediction, supporting trustworthy AI-assisted decision-making in computational pathology.

9
MurineCyto-Det: A High-Resolution Murine BALF Cytology Dataset for Leukocyte Segmentation and Detection

Le, T. X.; Tran, L.-A. T.; Farabi, D. A.; Wang, S.; Phan, A. T. Q.; Cormier, S. A.; Taada, A.; McGrew, D.; Du, Y.; Vu, L. D.

2026-05-12 bioinformatics 10.64898/2026.05.08.723893 medRxiv
Top 0.2%
1.3%
Show abstract

Automated analysis of murine bronchoalveolar lavage fluid (BALF) cytology is important for preclinical respiratory research, yet progress has been limited by the lack of publicly available, well-annotated mouse BALF image datasets. We present MurineCyto-Det, a high-resolution murine BALF cytology dataset comprising 333 image tiles of size 1024x1024 pixels, annotated across five cytological categories with both pixel-level segmentation masks and one-to-one matched bounding boxes. The dataset contains 14,551 annotated cell instances and supports two complementary analysis tasks: morphology-oriented cell segmentation and object-level cell detection. To establish reproducible benchmark baselines, we evaluated representative segmentation and detection models. The results demonstrate the practical utility of MurineCyto-Det while highlighting realistic challenges arising from class imbalance, small object size, irregular cell morphology, and ambiguous debris-like structures. MurineCyto-Det provides a standardized resource for developing, evaluating, and comparing automated methods for murine BALF cytology analysis. The dataset is publicly available at https://doi.org/10.5281/zenodo.17608677.

10
Dual-Stream Compression of High Bit-Depth Medical Images with Application to DNA Storage

Su, H.; Fan, W.; Peng, J.; Zhang, Y.

2026-05-20 bioinformatics 10.64898/2026.05.17.724501 medRxiv
Top 0.2%
1.2%
Show abstract

High bit-depth medical images preserve subtle intensity variations that are important for quantitative analysis and clinical interpretation, but their large dynamic range poses challenges for efficient compression. We propose a bit-plane-aware dual-stream compression framework for 16-bit medical images by separately modeling the most significant bit (MSB) and least significant bit (LSB) components. The MSB structural stream is encoded using JPEG coding with a Duplicate Segment Skipping (DSS) strategy to exploit spatial and segment-level redundancy, while the LSB detail stream is compressed using learned image compression to represent residual variations and fine-grained details. Experiments on four MRI and CT datasets show that the proposed method consistently outperforms representative traditional and learning-based codecs, achieving the lowest bit rate across all datasets. Meanwhile, it preserves high reconstruction fidelity. As a downstream application, we further demonstrate that the compressed bitstreams can be effectively integrated with DNA encoding and converted into sequences with favorable biochemical properties.

11
Predicting bladder cancer molecular subtypes linked to bacillus Calmette-Guerin response from histology images using deep learning

Khoraminia, F.; Olislagers, M.; de Jong, F. C.; Akram, F.; Nakauma Gonzalez, A.; Lichtenberg, D.; Stubbs, A.; Costello, J. C.; Rijstenberg, L.; van Leenders, G. J. L. H.; Vrieling, A.; Aben, K. K. H.; Kiemeney, L. A. L. M.; Hoedemaeker, R. F.; Bangma, C. H.; Vermeulen, S.; Litjens, G.; Khalili, N.; Zuiverloon, T. C. M.

2026-05-06 oncology 10.64898/2026.05.05.26352375 medRxiv
Top 0.3%
1.0%
Show abstract

Background and objectiveHighrisk nonmuscleinvasive bladder cancer (HRNMIBC) is treated with transurethral resection and intravesical BCG instillations, yet {approx}50% recur and 20% progress to invasive disease. Although molecular subtyping, e.g., BCG-response-subtype (BRS), is associated with progression risk and may aid risk stratification, yet is costly and time-consuming. Intratumoral heterogeneity complicates accurate subtyping. To address these challenges, we developed a deep-learning model that predicts BRS from routine hematoxylin-eosin-stained images. We verified the models area-by-area predictions against tissue-level gene-expression maps. Methods and participantsHematoxylin-eosin-stained images from 231 HR-NMIBC patients with known BRS were used to develop a deep-learning model through cross-validation, then validated in 83 independent samples. The models spatial predictions were assessed using spatial transcriptomics to map gene expression to tissue locations in five HR-NMIBC tumors. Outcome measurements and statistical analysisDiscriminative ability for BRS3 vs. BRS1/2 was measured by AUC. Spatial alignment was assessed by calculating Pearson and Spearman correlation coefficients between model predictions and BRS fractions; significance was assessed through permutation analysis. Key findings and limitationsThe trained algorithm achieved AUC of 0.79 (development) and 0.71 (external) to detect BRS3 vs BRS1/2. Tile-level correlation between model output and molecular labels was significant (Pearson r = 0.33-0.44; p [≤] 0.002). Limitations include retrospective sampling and limited spatial transcriptomic cases. Conclusions and clinical implicationsOur trained algorithm showed potential to stratify HRNMIBC patients by clinically relevant BCGresponse subtypes using routine hematoxylin-eosin-stained images and showed predicted spatial heterogeneity comparable to molecular profiling. Prospective validation is required before any clinical implementation. Patient summaryStandard pathology images contain hidden details related to tumors molecular subtype. We trained an AI model to read these routine images and identify specific bladder cancer subtypes associated with poor response to BCG therapy. This approach may help reveal molecular subtype-associated information from routine pathology images, without additional laboratory procedures.

12
Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

Wittlinger, S.; Meerjansen, J.; Wolf, F.; Wiest, I. C.; Ebert, M. P.; Siegel, F.; Belle, S.

2026-05-06 health systems and quality improvement 10.64898/2026.05.04.26352392 medRxiv
Top 0.3%
0.9%
Show abstract

ObjectiveStructured extraction from clinical free-text depends on human annotators whose labels are susceptible to errors and knowledge-driven mistakes; exhaustive quality control is impractical at scale. We evaluate whether disagreement among multiple locally hosted large language models (LLMs) can prioritize human annotations for targeted review. MethodsMultiple LLMs independently extract the same set of structured variables annotated by a human reviewer. For each annotation, an agreement score counts the LLMs matching the human label. Using four locally hosted LLMs (Gemma 3 27B, DeepSeek-R1 70B, GPT-OSS 120B, Mistral Large 3), we evaluated this approach on 910 German-language colonoscopy reports describing endoscopic mucosal resection, with five structured variables per case (anatomical location, two diameters, resection technique, multiple polyps), yielding 4,550 annotations and a 377-case adjudication sample. A stratified sample oversampling low-agreement strata was adjudicated blinded by an experienced reviewer and analyzed with prevalence-adjusted estimates ResultsHuman error rates rose as LLM agreement fell, from 0% at scores 3-4 to 76% at score 0. The lowest-agreement stratum was only 6.5% of annotations yet concentrated an estimated 80% of errors. The multi-LLM disagreement score achieved a prevalence-adjusted AUC-ROC of 0.991 (95% CI 0.987-0.994) and AUC-PR of 0.893 (95% CI 0.851-0.929) for error detection. DiscussionMulti-LLM disagreement outperformed single models and provided graded operating points for risk-stratified review. ConclusionMulti-LLM disagreement provides a scalable quality-control signal for targeted review of the highest-yield cases. Because all models run locally, the framework is GDPR-compliant; its language- and task-agnostic design supports application across clinical domains.

13
Case-level artificial intelligence for multi-photo teledermatology submissions: development and internal validation using patient-submitted dermatology images

Patel, V. P.; Sheth, N.; Patel, A.; Patel, Y.

2026-06-01 dermatology 10.64898/2026.05.21.26353816 medRxiv
Top 0.3%
0.8%
Show abstract

Background: Store-and-forward teledermatology commonly relies on several patient-submitted photographs of the same concern, but most dermatology artificial intelligence models classify single images independently. Objective: To develop and internally validate a case-level diagnostic-support model that aggregates multiple patient-submitted photographs for common dermatologic conditions. Methods: We conducted a retrospective diagnostic-modeling study using the Skin Condition Image Network, a public dataset of deidentified self-taken dermatology images from US adults. We curated 2,336 cases comprising 5,041 images across 10 common inflammatory, allergic, and infectious conditions. Cases were split at the submission level into training, validation, and held-out test sets. Frozen general-purpose and dermatology-specific encoders were compared with image-level classifiers and a gated-attention multiple instance learning model that generated one case-level output from 1-3 images. Results: The strongest image-level baseline, dermatology-specific embeddings with random forest classification, achieved macro/micro ROC-AUCs of 0.797/0.854. Case-level aggregation improved discrimination, with dermatology-specific embeddings plus multiple instance learning achieving mean macro/micro ROC-AUCs of 0.819/0.863 across repeated stratified experiments. The locked final model achieved macro/micro ROC-AUCs of 0.800/0.849 on the held-out test set. Balanced-threshold sensitivity/specificity examples were 0.702/0.688 for eczema and 0.818/0.826 for urticaria. Limitations: Internal validation used a 10-condition subset from a US volunteer dataset; external validation, calibration, subgroup performance analysis, and prospective workflow studies are required. Conclusion: Modeling the teledermatology submission as a multi-image case better reflects asynchronous dermatology workflow than single-image classification. The model is preliminary clinician-facing support for structured review and triage, not autonomous diagnosis.

14
Pixel-Based Skin Tone Estimation on Dermoscopy: A Dual-Rater MST Benchmark and Feasibility Study

Kumarasinghe, A.; Bui, V.; Ghanbarzadeh, R.

2026-05-17 health informatics 10.64898/2026.05.13.26353004 medRxiv
Top 0.3%
0.8%
Show abstract

Skin-tone labels are absent from public dermoscopy benchmarks such as the International Skin Imaging Collaboration (ISIC), making it impossible to audit whether clinical AI performs equitably across skin tones. While several recent works estimate skin tone automatically from clinical photography and selfies, we ask whether this approach is feasible on dermoscopy, the primary imaging modality of these benchmarks. To answer this, we make three main contributions. First, we release MST-Derm, a dual-rater Monk Skin Tone (MST) annotation benchmark on 500 ISIC 2018 images. Raters were given an explicit unrateable option for crops where the skin surrounding the lesion was too occluded to label confidently. We find that 60% of images were marked unrateable, yielding a 193-image consensus subset (quadratic-weighted Cohen's Kappa = 0.82). Second, we conduct a systematic feasibility study of three pixel-based MST annotation pipelines spanning the principal families in prior work: palette matching in perceptual colour space, robust colour statistics, and projection to a 1D colorimetric scalar. All three pipelines produce ordinal signal above chance (95% confidence intervals on quadratic-weighted Kappa exclude zero). However, ISIC 2018's extreme light-skin bias leaves 82% of the evaluation set at MST 2, giving a constant "always predict MST 2" baseline an accuracy floor the methods cannot overcome. To separate algorithmic signal from dataset bias, we evaluate on a class-balanced subset. The best method reaches quadratic-weighted Kappa = 0.43 against the trivial baseline of Kappa = 0.00, confirming the signal is genuine. Third, we diagnose this performance ceiling. We trace the bottleneck to two causes: dermoscopy's specialised illumination physically compresses the colour range on which lighter skin tones differ, and ISIC's dataset skew makes standard absolute-accuracy metrics uninformative. We conclude that while pixel-based colour features carry real MST signal on dermoscopy, current performance is insufficient for autonomous annotation. We release the benchmark, annotation protocol, all prediction runs, and analysis code to facilitate the development of robust skin-tone estimators, a vital prerequisite for accurately auditing fairness and mitigating bias in dermatological machine learning.

15
Deep learning-based recognition model for surgical phases of minimally invasive hysterectomy: A multicentre retrospective study

Koike, R.; Takenaka, S.; Suzuki, Y.; Matsuzaki, H.; Harada, Y.; Nakabayashi, M.; Hirose, Y.; Chikazawa, K.; Shimada, K.; Yoshiizumi, E.; Komatsu, H.; Tanabe, H.; Matsumoto, K.

2026-05-17 obstetrics and gynecology 10.64898/2026.05.13.26353100 medRxiv
Top 0.3%
0.8%
Show abstract

Objective: To develop and validate a robust deep-learning model capable of fine-grained phase recognition in total hysterectomy, particularly the complex periuterine dissection phase. Design: Multicentre retrospective observational study. Setting: Japan. Sample: Surgical videos (n = 764) from 43 institutions. Methods: We developed a robust and generalisable deep-learning model for surgical phase recognition in total hysterectomy, applicable to laparoscopic and robot-assisted procedures. Overall, 1,591,334 still images were annotated across nine surgical phases. A convolutional neural network (Xception architecture) was trained on 200 cases using four-fold cross-validation, with institutional separation between training and testing sets. Main outcome measures: Model performance was assessed using accuracy, precision, recall, and F1 score. Subgroup analysis and logistic regression evaluated the association between background clinical factors and recognition accuracy. Results: The model achieved an overall phase recognition accuracy of 0.78 (95% CI: 0.74--0.80), with a precision of 0.75 (95% CI: 0.72--0.78) and a recall of 0.76 (95% CI: 0.74--0.78). Performance was consistent across laparoscopic and robot-assisted procedures and across most surgical phases. Accuracy plateaued after training on 120 cases. No clinical factors significantly impacted performance. Trends toward lower accuracy were observed for cases with cervical myoma and pouch of Douglas adhesions. Conclusions: This model demonstrated high accuracy across diverse institutions and patient backgrounds. Its potential applications include surgical education, real-time intraoperative support, and training efficiency enhancement.

16
Failure detection in medical image classification under realistic distribution shifts: A large-scale benchmark

Steinmetz, P.; Frouin, F.; Morard, V.; Buvat, I.

2026-05-05 radiology and imaging 10.64898/2026.05.04.26350496 medRxiv
Top 0.3%
0.8%
Show abstract

Medical images (MI) exhibit variability due to different acquisition protocols, devices, and patient populations, making failure detection at inference time essential for reliable deployment of clinical classifiers. As existing evaluations of failure detection methods use different settings, it is difficult to compare results and identify the best strategy, if any. We present a comprehensive benchmark of eight confidence scoring functions and two score-aggregation strategies across eight MI tasks spanning diverse modalities, backbone architectures, training setups, and failure sources. The confidence ranking ability and classification error mitigation are jointly evaluated. While no single method systematically dominated across settings, aggregation of confidence scores consistently matched or approached the best individual method and substantially reduced silent failure rate. The failure detection performance was strongly correlated with classifier accuracy for all tested settings. These findings provide large-scale evidence regarding the strengths and limitations of confidence scoring strategies and offer actionable guidance for mitigating silent failures under realistic distribution shifts in MI.

17
Non-invasive Transcriptomic Cell Profiling of the Human Endometrium with Generative Deep Learning

Meltsov, A.; Falcon-Perez, J. M.; Matorras, R.; Apostolov, A.; Sola-Leyva, A.; Esteki, M. Z.; Salumets, A.; Aleksejeva-Zagura, E.

2026-05-20 obstetrics and gynecology 10.64898/2026.05.18.26352867 medRxiv
Top 0.4%
0.7%
Show abstract

Background Delineating the cellular origins of extracellular vesicles (EVs) enables the detection of clinically relevant changes in dynamic and complex tissues, such as the endometrium, which are not characterizable through single biomarker assays. Transcriptome deconvolution into cellular composition using deep learning methods provides a means to explore this complexity. However, such computational methods have not been previously applied to EV bulk transcriptomes, and their efficacy in profiling EV population changes and concordance to tissue throughout the menstrual cycle remains unknown. Methods This observational cross-sectional study utilized a deconvolutional generative deep learning algorithm, BulkTrajBlend, trained on a comprehensive human endometrial single-cell RNA sequencing (scRNA-seq) atlas. The model was applied to deconvolve paired bulk transcriptomes from endometrial tissue and uterine fluid EVs (UF-EVs) across the proliferative (P, n=4), early-secretory (ES, n=5), mid-secretory (MS, n=5), and late-secretory (LS, n=5) phases from healthy, fertile women. To validate generalizability, independent UF-EV datasets (ES, n=12; MS, n=12) obtained via different laboratory protocols were included. Deconvolved pseudo-single-cell (pSC) profiles from UF-EV data were subsequently integrated with Visium spatial transcriptomics slides of human endometrium (P, n=2; MS, n=4; ES, n=2). Results We developed a foundation model-based approach utilizing self-supervised learning to determine the cellular origin of EVs from their transcriptomic profiles. By mapping the generated pSC profiles to spatial transcriptomic data, we evaluated spatial origins of EVs. The statistical analysis demonstrated that UF-EV transcriptome deconvolution reflects the dynamic changes in the cellular composition of endometrial tissue across the menstrual cycle phases. The ability to distinguish accurately between proliferative and decidualizing menstrual cycle phases (ROC-AUC = 0.98) using cellular profile of deconvoluted UF-EVs transcriptome enables non-invasive profiling of endometrial tissue. Conclusions Our findings indicate the feasibility of determining endometrial tissue cellular composition using UF-EV transcriptomics. This methodology enables refined, non-invasive endometrial testing, avoiding invasive biopsy procedures. Based on deconvolution results, we are able to correlate UF-EV content to tissue, and distinguish between menstrual cycle phases. These results build toward a multifactorial screening method for abnormalities within the endometrium.

18
A Multimodal Neural Network Model for Early Recurrence Prediction in Lung Adenocarcinoma

Patricoski-Chavez, J. A.; Hayek, K.; Singh, R.; Azzoli, C. G.; Warner, J. L.; Gamsiz Uzun, E. D.

2026-05-18 bioinformatics 10.64898/2026.05.14.725244 medRxiv
Top 0.4%
0.7%
Show abstract

Lung adenocarcinoma (LUAD), a subtype of non-small cell lung cancer (NSCLC), is the most common primary lung cancer worldwide. Despite advancements in early detection and treatment, up to 39% of patients develop recurrent tumors following complete resection. Currently, no widely available models exist for reliably predicting early recurrence of LUAD, which is a significant prognostic factor of post-recurrence survival. Models leveraging deep learning (DL) techniques have demonstrated notable utility in cancer recurrence prediction, particularly when used in combination with both clinical and genomic data. We developed a DL-based model, Predicting Lung Adenocarcinoma recurrence via Selective Multimodal Attention (PLASMA), to predict early recurrence using clinical, mRNA expression, and mutation data from patients with primary stage I-III LUAD. Trained on The Cancer Genome Atlas (TCGA) dataset, PLASMA outperformed traditional machine learning models in predicting early recurrence in both the TCGA test set and an external validation set (TRACERx Lung), achieving area under the receiver operating characteristic curve (AUROC) scores of 85.0% and 76.5%, respectively. Our results support the potential of multimodal DL for early LUAD recurrence prediction and risk stratification.

19
AI Decision Support for Challenging Teledermatology Cases: MedGemma Performance in the Dermatology ECHO Program

Appiagyei, J. B.; Otu, R. O.; Henry, M. K.; Casterline, B. W.; Becevic, M.

2026-05-26 health informatics 10.64898/2026.05.21.26353523 medRxiv
Top 0.4%
0.7%
Show abstract

Teledermatology expands access to dermatologic expertise in rural settings, yet diagnostic uncertainty persists in low-resource primary care. This retrospective study evaluated MedGemma-4B-IT, a compact multimodal vision-language model, as adjunctive clinical decision support for challenging diagnostic cases. We analyzed 77 zero-concordance cases (360 clinical photographs) from a Dermatology Extension for Community Healthcare Outcomes (ECHO) tele-mentoring program (2016-2021). Zero-concordance cases showed no overlap between primary clinician provisional diagnosis and dermatologist-confirmed diagnosis. The model was prompted using dermatologist-style format to generate ranked differential diagnoses. Performance was assessed using strict case-level top-k exact-match accuracy and relaxed matching criteria based on fuzzy string similarity. MedGemma achieved 0.0% strict top-1 accuracy, 1.3% top-3 accuracy, 3.9% top-5 accuracy, and 3.9% top-10 accuracy. Relaxed concept-level matching achieved 28.6% top-1, 63.6% top-5, and 67.5% top-10 accuracy. Image-level accuracy was 44.2% (159/360, 95% CI 39.0-49.5%). The model surfaced the correct diagnosis within differential lists in 45.5% of cases despite no exact top-1 matches, suggesting utility for differential expansion rather than definitive diagnosis. Performance varied across diagnostic categories, with highest accuracy in Other categories (54.5%) and lowest in neoplastic conditions (0.0%). Common errors included confusion between inflammatory and other diagnostic groupings. These findings characterize MedGemma performance on real-world teledermatology cases and inform safe, clinician-in-the-loop integration into teledermatology workflows where specialist oversight remains essential.

20
Multi-Scale Tri-Modal Histology Dataset Integrating Tumor Morphology, Immune Patterns, and Clinical Outcomes

Jung, K. J.; Qiu, J.; Cho, S.; McDonough, E.; Chadwick, C.; Ghose, S.; West, R. B.; Brooks, J. D.; Ginty, F.; Machiraju, R.; Mallick, P.

2026-05-19 bioinformatics 10.64898/2026.05.15.725535 medRxiv
Top 0.4%
0.7%
Show abstract

Accurate prognostic assessment of prostate cancer (PCa) requires an integrated understanding of tissue morphology-encompassing cell structure, glandular architecture, and tissue organization-and the immune environment. We present Prostate-TriMod, a novel tri-modal histology dataset designed to integrate high-resolution visual morphology with spatial tissue maps, immune infiltration patterns, and clinical outcomes. This dataset, generated from the Cell DIVE multiplexed imaging platform, consists of three synchronized modalities: (1) multiscale virtual H&E tiles (224px, 256px, 512px, and 2040px) providing visual morphological context, (2) spatial tissue maps identifying cancerous/non-cancerous epithelial cells, stroma and immune cell populations (via TOPAZ and CAT models), and (3) text captions generated from single-cell data and patterns. The dataset includes comprehensive clinical annotations, including Grade Groups and biochemical recurrence (BCR) status. By providing high-fidelity alignment between visual features, spatial tissue maps, and textual descriptions, Prostate-TriMod empowers the development of advanced multimodal AI frameworks. We expect this resource to support reuse in multimodal representation learning, spatial analysis, and benchmarking studies that link histology morphology and immune context to clinical outcomes in prostate cancer.